Tutorial by Katherine Wang and Kaitlin Sim
Pizza is an iconic staple for Americans and is one of the world’s most popular foods. With such a large variety of pizza chains in America, the best and worst pizza chains are highly debated. Well known chains like Domino’s and Pizza Hut are either loved or hated, but how are they still so successful? In this tutorial, we will investigate different factors that could potentially affect the success of a pizza chain by analyzing the menus and Yelp reviews of the top ten highest revenue pizza chains. We will use revenue as the measurement of success. In order of highest to lowest revenue, these top chains include Domino’s Pizza, Pizza Hut, Little Caesars Pizza, Papa John’s, California Pizza Kitchen, Papa Murphy’s Pizza, Sbarro, Marco’s Pizza, Chuck E. Cheese’s, and Round Table Pizza. We want to only look at these successful chains so that we can compare them and see if they have anything in common that could link to their success.
We looked at several different datasets to analyze what makes a pizza chain successful. We web scraped the top ten pizza chain business data from Pizza Today's 2019 Top 100 Pizza Companies. We also used the Yelp Dataset to look at reviews and ratings of different chains and Datafiniti’s Pizza Restaurants and the Pizza They Sell dataset to look at menu items and prices.
Let’s first web scrape the top ten pizza chain data from Pizza Today's 2019 Top 100 Pizza Companies. We parsed the data it into a pandas DataFrame with the respective column names:
# !pip install folium
# !pip install plotly==5.4.0
import pandas as pd
import numpy as np
from datetime import datetime
import folium
import re
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import plotly.express as px
import itertools
# First, make an HTTP request to the Pizza Today to scrape the data
# Create a BeautifulSoup object with the html text and find the table within it using string slicing.
# Then, parse it into a pandas DataFrame with the respective column names.
response = requests.get("https://pizzatoday.com/pizzeria-rankings/2019-top-100-pizza-companies/")
text = response.text
soup = BeautifulSoup(text, 'html.parser')
pretty_soup = soup.prettify()
start_index = pretty_soup.find("<table class=\"top100\">")
end_index = pretty_soup.find("</table>") + 8
table = pretty_soup[start_index:end_index]
chain_df = pd.read_html(table)[0]
chain_df.columns = ['rank', 'name', 'delete', 'executive', 'location', 'num_units', 'revenue']
chain_df = chain_df.drop(['executive', 'location', 'delete'], axis=1)
# pizza chain name maps to its regex
top_pizza_chains = {
"Domino's Pizza": "Domino",
"Pizza Hut":"Pizza Hut",
"Little Caesars Pizza":"Little Caesar",
"Papa John's":"Papa John",
"California Pizza Kitchen":"California Pizza Kitchen",
"Papa Murphy's Pizza":"Papa Murphy",
"Sbarro":"Sbarro",
"Marco's Pizza":"Marco(.?)s", # "Marco's Restaurant" will be a problem
"Chuck E. Cheese's":"Chuck E",
"Round Table Pizza":"Round Table"
}
chain_names = list(top_pizza_chains.keys())
for name, reg in top_pizza_chains.items():
name_to_replace = chain_df.loc[chain_df['name'].str.match(reg)]['name'].values[0]
chain_df['name'].replace(name_to_replace, name, inplace=True)
top_chains_df = chain_df[:10] # get only top 10 chains
top_chains_df
| rank | name | num_units | revenue | |
|---|---|---|---|---|
| 0 | 1 | Domino's Pizza | 15914 | $13,545,200,000 |
| 1 | 2 | Pizza Hut | 18431 | $12,212,000,000 |
| 2 | 3 | Little Caesars Pizza | 5465 | $4,770,500,000 |
| 3 | 4 | Papa John's | 5345 | $3,500,000,000 |
| 4 | 5 | Papa Murphy's Pizza | 1404 | $808,727,000 |
| 5 | 6 | California Pizza Kitchen | 244 | $785,000,000 |
| 6 | 7 | Marco's Pizza | 930 | $637,889,020 |
| 7 | 8 | Chuck E. Cheese's | 749 | $500,000,000 |
| 8 | 9 | Sbarro | 633 | $482,040,000 |
| 9 | 10 | Round Table Pizza | 460 | $470,000,000 |
Now that we know what the top pizza chains by revenue are, we can begin cleaning and processing the other datasets. In the Yelp Dataset, we will only use yelp_academic_dataset_business.json and yelp_academic_dataset_review.json since we only want to look at review and star rating data.
In yelp_academic_dataset_review.json, there is a field called “business_id” instead of having a field for the business name, so we need to use yelp_academic_dataset_business.json, which also has the “business_id” field, to get the corresponding names of the businesses for each review. Before doing that, we will first clean the business dataset and only keep the pizza businesses. Now, we can merge both datasets into one dataframe.
Since both original datasets are quite large and this code takes a while to run, we will save the merged dataframe as a new csv file called yelp_reviews_Pizza_categories.csv for efficiency.
#### clean and merge Yelp business and review data. then save new dataframe as a csv ####
# kernel may die if you run this section in Jupyter Notebook
### CLEAN YELP BUSINESS DATASET ###
business_data_path = 'data/yelp_academic_dataset_business.json'
df_biz = pd.read_json(business_data_path, lines=True)
# only keep businesses that are still open (1 = open, 0 = closed)
df_biz = df_biz[df_biz['is_open'] == 1]
drop_columns = ['address', 'hours','is_open','review_count','attributes']
df_biz = df_biz.drop(drop_columns, axis=1)
pizza_biz = df_biz[df_biz['categories'].str.contains('Pizza', case=False, na=False)]
#view individual categories
df_biz_explode = pizza_biz.assign(categories = pizza_biz.categories
.str.split(', ')).explode('categories')
# print(pizza_biz)
### READ YELP REVIEW DATASET ###
b_pandas = []
rev_dtypes = {
'review_id':str,
'user_id':str,
'business_id':str,
'stars':np.float16,
'date':str,
'text':str,
'useful':np.int32,
'funny':np.int32,
'cool':np.int32
}
review_data_path = 'data/yelp_academic_dataset_review.json'
size = 1000000
df_review = pd.read_json(review_data_path, lines=True,
dtype=rev_dtypes,
chunksize=size)
### MERGE BUSINESS AND REVIEW DATA ###
# this part may take a while to run...
chunk_list = []
for chunk_review in df_review:
# Drop columns that aren't needed
chunk_review = chunk_review.drop(['user_id','review_id','useful','funny','cool'], axis=1)
# Renaming column name to avoid conflict with business overall star rating
chunk_review = chunk_review.rename(columns={'stars': 'review_stars'})
# Inner merge with edited business file so only reviews related to the business remain
chunk_merged = pd.merge(pizza_biz, chunk_review, on='business_id', how='inner')
# Show feedback on progress
print(f"{chunk_merged.shape[0]} out of {size:,} related reviews")
chunk_list.append(chunk_merged)
# After trimming down the review file, concatenate all relevant data back to one dataframe
df = pd.concat(chunk_list, ignore_index=True, join='outer', axis=0)
# save df as a csv
csv_name = "yelp_reviews_Pizza_categories.csv"
df.to_csv(csv_name, index=False)
Now that we have the pizza subset of our merged Yelp data, we can do some more cleaning on the pizza chain data and also clean the Datafiniti dataset. Each row in the Datafiniti dataset represents a pizza from a restaurant and contains information about the business, prices, and more. After filtering out columns in both datasets that we didn’t need, we removed rows that were not related to the top ten pizza chains and standardized the spelling and punctuation of the chains’ names.
# load datasets
yelp_path = "data/yelp_reviews_Pizza_categories.csv"
pizza_path = "data/Datafiniti_Pizza_Restaurants_and_the_Pizza_They_Sell_May19.csv"
yelp_data = pd.read_csv(yelp_path)
pizza_data = pd.read_csv(pizza_path)
pizza_fields = [
'categories',
'city',
'country',
'latitude',
'longitude',
'menus.amountMax',
'menus.amountMin',
'menus.name',
'name',
'postalCode',
'priceRangeMin',
'priceRangeMax',
'province'
]
# filter out unnecessary fields
pizza_data = pizza_data[pizza_fields]
yelp_data = yelp_data.drop('business_id', axis=1)
yelp_data = yelp_data.loc[yelp_data['state'] != 'BC'] # BC is in Canada which we don't want
## problem: some of the names have different punctuation/have extra whitespace
## ie. "Papa John's" -> "Papa John's Pizza", "Papa John's Pizza ", 'Papa Johns'
## This function will replace these occurences with the proper name
def get_top_chains(df):
res_names = sorted(list(df['name'].unique()))
res_names_versions = {} # get all different versions of the chain's name
# some of the names of the chains are shortened or spelled wrong, find all occurences/versions
for name, reg in top_pizza_chains.items():
pos_names = [x for x in res_names if re.search(reg, x)]
if name == "Marco's Pizza" and "Marco's Restaurant" in pos_names: # just do this manually for now -- need to come up with a regex to remove this
pos_names.remove("Marco's Restaurant")
res_names_versions[name] = pos_names
# make all names consistent by replacing different versions with proper name
for name, versions in res_names_versions.items():
df['name'].replace(versions, name, inplace=True)
# get rows from the top ten pizza chains
df = df.loc[df['name'].isin(top_pizza_chains.keys())]
return df
## Note: To narrow down our data, let's just look at the pizza chains in America with the highest revenue
# 2019 top 10 pizza chains by revenue according to Pizza Today
yelp_data_top_pizza = get_top_chains(yelp_data)
pizza_data_top = get_top_chains(pizza_data)
# checking if there is a difference between menus.amountMax and menus.amountMin,
# menus.amountMin is dropped since the columns are the same
if pizza_data_top['menus.amountMin'].equals(pizza_data_top['menus.amountMax']):
pizza_data_top = pizza_data_top.drop('menus.amountMin', axis=1)
pizza_data_top.rename(columns={'menus.amountMax':'menus.itemPrice'}, inplace=True)
print("Top pizza chain subset of Yelp data: \n", yelp_data_top_pizza.head())
print("\nTop pizza chain subset of Datafiniti data: \n", pizza_data_top.head())
Top pizza chain subset of Yelp data:
name city state postal_code latitude longitude \
117 Chuck E. Cheese's Orlando FL 32818 28.551335 -81.483167
118 Chuck E. Cheese's Orlando FL 32818 28.551335 -81.483167
119 Chuck E. Cheese's Orlando FL 32818 28.551335 -81.483167
120 Chuck E. Cheese's Orlando FL 32818 28.551335 -81.483167
121 Chuck E. Cheese's Orlando FL 32818 28.551335 -81.483167
stars categories review_stars \
117 2.0 Pizza, Event Planning & Services, Arcades, Par... 3.0
118 2.0 Pizza, Event Planning & Services, Arcades, Par... 1.0
119 2.0 Pizza, Event Planning & Services, Arcades, Par... 3.0
120 2.0 Pizza, Event Planning & Services, Arcades, Par... 5.0
121 2.0 Pizza, Event Planning & Services, Arcades, Par... 3.0
text date
117 Took my children and her friends there today..... 2013-11-25 23:14:15
118 very poorly managed.....dirty, bad food, machi... 2011-06-11 22:48:23
119 We went on a week night around 5 pm. The plac... 2015-12-02 12:01:19
120 I went here on a whim last night. We were ther... 2016-05-06 16:58:42
121 My in-laws are in town from Missouri (this is ... 2011-07-27 18:51:58
Top pizza chain subset of Datafiniti data:
categories city country latitude \
17 Restaurant,Pizza Place,Restaurants El Paso US 31.862409
41 Restaurant,Pizza Place Jacksonville US 30.204576
110 Restaurant,Pizza Place Iowa City US 41.649871
111 Restaurant,Pizza Place Iowa City US 41.649871
112 Restaurant,Pizza Place Iowa City US 41.649871
longitude menus.itemPrice \
17 -106.558431 5.00
41 -81.785252 7.99
110 -91.501576 13.00
111 -91.501576 13.00
112 -91.501576 13.00
menus.name name \
17 HotNReady Large Pizza Little Caesars Pizza
41 Large 3 Topping Pizza Little Caesars Pizza
110 Garden Vegetarian Pizza Baking Required Papa Murphy's Pizza
111 Medium Chicken Garlic Gluten Free Crust Pizza ... Papa Murphy's Pizza
112 Medium Cowboy Gluten Free Crust Pizza Baking R... Papa Murphy's Pizza
postalCode priceRangeMin priceRangeMax province
17 79912 0 25 TX
41 32257 0 25 FL
110 52240 0 25 IA
111 52240 0 25 IA
112 52240 0 25 IA
The top pizza chain subset of the Yelp data contain the below fields and each row represents a review:
Here are the fields in the cleaned top pizza chain subset of the Datafiniti dataset:
We noticed that in the Datafiniti dataset, there was a lack of data from Chuck E. Cheese’s. This could be due to the fact that the Datafiniti dataset, which we downloaded from Kaggle, is only a sample of a larger Datafiniti dataset, which is unfortunately not free. Also, for some reason, all of the prices for the Chuck E. Cheese’s pizzas were zero. To resolve these issues, we manually added Chuck E. Cheese’s data to the Datafiniti data using the online ordering menu on Chuck E. Cheese’s’ website.
# manually add Chuck E. Cheese's menu items and prices since there is only one menu item for Chuck E. Cheese
pizza_data_top.loc[pizza_data_top['name'] == "Chuck E. Cheese's", ['menus.itemPrice']] = 9.49
chuckecheese_row = pizza_data_top.loc[pizza_data_top['name'] == "Chuck E. Cheese's"]
chuckecheese_itemprice_map = {
'large 5 meat pizza': 23.99,
'medium 5 meat pizza': 19.99,
'large supreme pizza': 23.99,
'medium supreme pizza': 19.99,
'large veggie pizza': 23.99,
'medium veggie pizza': 19.99,
'large cheese pizza': 18.99,
'medium cheese pizza': 15.99
}
for item, price in chuckecheese_itemprice_map.items():
pizza_data_top = pizza_data_top.append(chuckecheese_row.assign(**{'menus.name':item, 'menus.itemPrice':price}), ignore_index=True)
print(pizza_data_top.head())
categories city country latitude \
0 Restaurant,Pizza Place,Restaurants El Paso US 31.862409
1 Restaurant,Pizza Place Jacksonville US 30.204576
2 Restaurant,Pizza Place Iowa City US 41.649871
3 Restaurant,Pizza Place Iowa City US 41.649871
4 Restaurant,Pizza Place Iowa City US 41.649871
longitude menus.itemPrice \
0 -106.558431 5.00
1 -81.785252 7.99
2 -91.501576 13.00
3 -91.501576 13.00
4 -91.501576 13.00
menus.name name \
0 HotNReady Large Pizza Little Caesars Pizza
1 Large 3 Topping Pizza Little Caesars Pizza
2 Garden Vegetarian Pizza Baking Required Papa Murphy's Pizza
3 Medium Chicken Garlic Gluten Free Crust Pizza ... Papa Murphy's Pizza
4 Medium Cowboy Gluten Free Crust Pizza Baking R... Papa Murphy's Pizza
postalCode priceRangeMin priceRangeMax province
0 79912 0 25 TX
1 32257 0 25 FL
2 52240 0 25 IA
3 52240 0 25 IA
4 52240 0 25 IA
After cleaning and parsing our datasets, we can now begin analyzing our data and creating visualizations.
Let’s first look at the Yelp reviews. We want to analyze this because we want to see if existing opinions have any effect on a pizza chain’s success. For this section, we looked at the review star ratings of each chain and calculated the ratios and counts of each star rating and average star rating of each pizza chain.
# Let's look at the yelp reviews
# first, split up yelp_data_top_pizza by each chain using the keys of top_pizza_chains
# top_chains = list(top_pizza_chains.keys())
yelp_data_by_chain = list(map(lambda n: yelp_data_top_pizza.loc[yelp_data_top_pizza['name'] == n], chain_names))
review_star_ratios = {} # ratio of review number of stars (out of 5 stars)
review_star_counts = {}
for chain_df in yelp_data_by_chain:
chain = chain_df['name'].unique()[0]
num_rows = chain_df.shape[0]
star_counts_ratios = dict(chain_df['review_stars'].value_counts(normalize=True).sort_index())
star_counts = dict(chain_df['review_stars'].value_counts().sort_index())
review_star_ratios[chain] = star_counts_ratios
review_star_counts[chain] = star_counts
# reogranize data so that it's easier to plot
ratio_data = [{"name": name, "stars": stars, "ratio": ratio} for name in review_star_ratios for stars, ratio in review_star_ratios[name].items()]
review_star_ratios_df = pd.DataFrame.from_dict(ratio_data)
count_data = [{"name": name, "stars": stars, "count": count} for name in review_star_counts for stars, count in review_star_counts[name].items()]
review_star_count_df = pd.DataFrame.from_dict(count_data)
# this dataframe contains ratio and count data for star rating frequencies
review_star_df = review_star_ratios_df.copy()
review_star_df['count'] = review_star_count_df['count']
review_star_df['stars'] = review_star_df['stars'].apply(int)
review_star_df['stars'] = review_star_df['stars'].apply(str)
review_star_df['avg_stars'] = 0
top_chains_df['avg_stars'] = 0
# find average overall rating for each chain and add to top_chains_df
for chain, stars in review_star_counts.items():
total_num_reviews = sum(stars.values())
total_stars = 0
for star, count in stars.items():
total_stars += star * count
avg_star_rating = total_stars/total_num_reviews
review_star_df.loc[review_star_df['name'] == chain, ['avg_stars']] = avg_star_rating
top_chains_df.loc[top_chains_df['name'] == chain, ['avg_stars']] = avg_star_rating
print(review_star_df.head())
print(top_chains_df.head())
name stars ratio count avg_stars 0 Domino's Pizza 1 0.579732 3461 2.21541 1 Domino's Pizza 2 0.081407 486 2.21541 2 Domino's Pizza 3 0.062647 374 2.21541 3 Domino's Pizza 4 0.096147 574 2.21541 4 Domino's Pizza 5 0.180067 1075 2.21541 rank name num_units revenue avg_stars 0 1 Domino's Pizza 15914 $13,545,200,000 2.215410 1 2 Pizza Hut 18431 $12,212,000,000 1.974376 2 3 Little Caesars Pizza 5465 $4,770,500,000 2.468610 3 4 Papa John's 5345 $3,500,000,000 2.289688 4 5 Papa Murphy's Pizza 1404 $808,727,000 3.278880
# let's visualize the frequency of each star rating (1 through 5 stars) for each chain using a scatter plot
fig = px.bar(review_star_df, x="name", y="count", color="stars",
labels={
"name":"Pizza Chain by Revenue (highest to lowest)",
"count":"Number of Ratings",
"stars":"Number of Stars (out of 5)"
})
fig.update_layout(title_text="Frequency of Review Star Ratings by Chain", title_x=0.5)
fig.show()
# let's visualize the ratios of number of reviews with 1 through 5 stars for each chain using a scatter plot
fig = px.scatter(review_star_df, x="name", y="ratio", color="stars",
labels={
"name":"Pizza Chain by Revenue (highest to lowest)",
"ratio":"Star Ratio",
"stars":"Number of Stars (out of 5)"
})
fig.update_layout(title_text="Ratio of Review Star Ratings by Chain", title_x=0.5)
fig.show()
In both plots, it is clear that for most pizza chains, the most common star rating is one star, even though these pizza chains have the highest revenue. This could be due to the fact that people tend to only write reviews when they have had a particularly good or bad experience. However, for California Pizza Kitchen the most common star rating is four stars and five stars for Papa Murphy’s Pizza and Marco’s Pizza. Thus, it is possible that for these three chains, they have high revenues because of these good ratings, since if potential customers see that a business has good reviews, they are more inclined to visit. It is also interesting to note from the first plot (Frequency of Review Star Ratings by Chain) that there is generally a decrease in the number of reviews as revenue decreases. It’s possible that the more reviews a chain has, the more customers it has which leads to higher revenue. We will need to dig deeper to see if review star ratings actually have an effect on a chain’s success.
# plot average star rating vs revenue
top_chains_df_reverse = top_chains_df.iloc[::-1]
fig_star = px.scatter(top_chains_df_reverse, x="revenue", y="avg_stars", color="name",
labels={
"name":"Pizza Chain by Revenue (low to high)",
"revenue":"Revenue (USD)",
"avg_stars":"Average Star Rating"
},
title="Average Star Rating (Out of 5 Stars) VS. Revenue")
fig_star.show()
The range for the overall star ratings from Yelp for the top pizza chains is [1.97, 3.30]. From the graph, it seems like the first four and the last three pizza chains of the top ten are relatively low compared to the middle three. This might make you wonder, why are the top two (out of the top ten) pizza chains, Dominos and Pizza Hut, rated so low? Why are the middle three pizza chains, Papa Murphy's Pizza, California Pizza Kitchen, and Marco's Pizza, rated so high?
There are some factors one has to consider about the nature of Yelp reviews in order to analyze why these businesses are so successful regardless its Yelp review. Yelp reviews mostly consist of reviews of relatively good or bad experiences and have less mediocre or standard satisfied reviews. A person is more likely to go out of their way to share their experience with a particular restaurant if they have an overwhelmingly positive or negative personal experience either with the food, atmosphere, or staff of the restaurant. There are other factors as well, such as faulty reviews, etc., that can be found here that could contribute to the results found.
The most successful pizza chain in terms of revenue is Domino's, and they have the most units in the country. But why is Domino's' average star rating only a 2.22 out of 5 stars on Yelp? With Domino's being such a well-known restaurant selling millions of pizzas a day, most people would most likely have a relatively good or average experience with Domino's that wouldn't incite them to make a Yelp review for the restaurant. However, if a customer has a relatively negative experience with the quality of the food or the staff at Domino's, or a well known restaurant in general, then they are more likely to write a review about their experience because these top businesses are expected to have high standards for their restaurants and upkeep a good reputation while a lesser known restaurant does not face these same expectations.
This still leaves the question of why did Papa Murphy's Pizza, California Pizza Kitchen, and Marco's Pizza have such high average star ratings on Yelp? After more research, it is clear that these pizzerias have something special about each of them that make them stand out from the average fast food pizza chain. Papa Murphy's is famous for their take-and-bake pizzas and also for their huge variety of items and fresh quality ingredients. California Pizza Kitchen is famous for their innovative pizza creations such as their Thai Chicken Pizza. Finally, Marco's Pizza is the only pizza place in the US that was founded by an Italian person; this pizzeria is known for their amazing quality, atmosphere and has had two award-winning pizzas.
We will plot all of the locations of the top pizza chains in the US.
A key of the colors:
# plot locations using Datafiniti and Yelp data
## combine Datafiniti and Yelp data locations into one dataframe ##
# need to remove duplicates of the same restaurant locations
pizza_data_unique_locs = pizza_data_top.drop_duplicates(subset=['latitude', 'longitude'])
yelp_data_unique_locs = yelp_data_top_pizza.drop_duplicates(subset=['latitude', 'longitude'])
pizza_data_unique_locs = pizza_data_unique_locs[['name', 'latitude', 'longitude', 'city', 'province']]
yelp_data_unique_locs = yelp_data_unique_locs[['name', 'latitude', 'longitude', 'city', 'state']]
pizza_data_unique_locs.rename(columns={'province':'state'}, inplace=True)
all_locs = pd.concat([pizza_data_unique_locs, yelp_data_unique_locs])
all_locs = all_locs.drop_duplicates(subset=['latitude', 'longitude'])
# create color map {chain:color}
colors = ['blue','red','orange','darkgreen','beige','gray','lightgreen','black','purple','pink']
color_map = {chain_names[i] : colors[i] for i in range(len(chain_names))}
# create map of the US (including Alaska)
map_osm = folium.Map(location=[48, -102], zoom_start=3)
# plot locations on map for top ten chains in pizza_data
for chain in top_pizza_chains:
all_locs.loc[all_locs['name'] == chain].apply(
lambda x:
folium.Marker(
location=[x.latitude, x.longitude],
popup=x['name'] + " : " + x['city'],
icon=folium.Icon(color=color_map[chain])
).add_to(map_osm), axis=1)
map_osm
We also plotted the top pizza chain locations in the top twenty most populated cities in the US (list of cities taken from World Population Review).
# plot locations on map for top chains in Datafiniti and Yelp data (all_loc) in major cities
# top 20 most populated cities in 2021
major_cities = {
'New York':'NY',
'Los Angeles':'CA',
'Chicago':'IL',
'Houston':'TX',
'Phoenix':'AZ',
'Philadelphia':'PA',
'San Antonio':'TX',
'San Diego':'CA',
'Dallas':'TX',
'Austin':'TX',
'San Jose':'CA',
'Fort Worth':'TX',
'Jacksonville':'FL',
'Columbus':'OH',
'Charlotte':'NC',
'Indianapolis':'IN',
'San Francisco':'CA',
'Seattle':'WA',
'Denver':'CO',
'Washington':'DC' # there is no DC anyway...
}
# get rows only from major cities
pizza_data_major_cities = all_locs.loc[all_locs['city'].isin(major_cities.keys())]
# create map of the US (including Alaska)
map_osm_major_cities = folium.Map(location=[35, -102], zoom_start=4)
# plot locations on map for top chains in pizza_data in major cities
for chain in top_pizza_chains:
pizza_data_major_cities.loc[pizza_data_major_cities['name'] == chain].apply(
lambda x:
folium.Marker(
location=[x.latitude, x.longitude],
popup=x['name'] + " : " + x['city'],
icon=folium.Icon(color=color_map[chain])
).add_to(map_osm_major_cities), axis=1)
map_osm_major_cities
When looking at the locations in the most populated cities in the US compared to all of the locations in the US as a whole, it is clear that the majority of the pizzerias are not in major cities. However, since the major cities are the most populated in the country, locations in these cities are more likely to bring in more customers and more revenue.
# counting the number of major city locations that each of the top 4 pizza chains has
num_chains_major_cities = {
'Domino\'s Pizza': 0,
'Pizza Hut': 0,
'Little Caesars Pizza': 0,
'Papa John\'s': 0,
'Papa Murphy\'s Pizza': 0,
'California Pizza Kitchen': 0,
'Marco\'s Pizza': 0,
'Chuck E. Cheese\'s': 0,
'Sbarro': 0,
'Round Table Pizza': 0
}
# calculate percentages
for chain in num_chains_major_cities:
num_chains_major_cities[chain] = len(pizza_data_major_cities[pizza_data_major_cities['name'] == chain])
for chain, num in num_chains_major_cities.items():
num_chains_major_cities[chain] = num / len(all_locs[all_locs['name'] == chain]) * 100
data_locs = {'chain': num_chains_major_cities.keys(), 'percent locations that are major cities': num_chains_major_cities.values()}
plot_locs = pd.DataFrame(data_locs)
print(plot_locs.head())
# create plot
fig_locs = px.bar(plot_locs, x="chain", y="percent locations that are major cities", color="chain",
labels={
"chain":"Pizza Chain by Revenue (highest to lowest)"
},
title="Percentage of Locations in Major Cities by Chain")
fig_locs.show()
chain percent locations that are major cities 0 Domino's Pizza 15.865385 1 Pizza Hut 14.893617 2 Little Caesars Pizza 14.678899 3 Papa John's 17.751479 4 Papa Murphy's Pizza 8.219178
As you go from left to right (highest to lowest revenue) across the chains, you'll notice that there is a general decrease in percentage of locations that are in major cities (vs non major cities in the US), except for a few outliers. Perhaps, the more revenue a pizza chain has, the more locations they are able to open in order to make even more money.
The three pizza chains that stand out from the trend are Papa John's, California Pizza Kitchen, and Chuck E. Cheese's. One speculation of why Papa John's has so many locations but less revenue than the top three pizza chains could be due to the history of success, but the more recent scandals and controversial remarks that earned Papa John's and its CEO a bad reputation ethically, especially from 2017-2018 (only a year or two before this data of 2019 revenue). The information about Papa John's scandals can be found here.
A speculation of why California Pizza Kitchen has so many locations in major cities is because the headquarters is in Los Angeles, CA which is the second most populated city in the US. It is expected for more locations of a chain to exist in the same city as its headquarters.
A speculation of why Chuck E. Cheese's has so many locations in major cities, but has lower revenue (from their pizza distributor Peter Piper Pizza) is due to the fact that Chuck E. Cheese's main business service pertains to children's entertainment, and pizza is just a food service that they provide at their locations. Therefore, Chuck E. Cheese would most likely be in major cities where there are more children and families as their target audience.
Now we will analyze menu item prices for each pizza chain.
# plot menu item prices by chain
# drop rows with prices greater than $100 since these are of catering/party packs
pizza_data_top_clean = pizza_data_top.copy()
pizza_data_top_clean = pizza_data_top_clean.loc[pizza_data_top_clean['menus.itemPrice'] < 100]
# add revenue for each chain to pizza_data_top_clean
pizza_data_top_clean['revenue'] = 0
for name in chain_names:
rev = top_chains_df['revenue'].loc[top_chains_df['name'] == name].item()
pizza_data_top_clean['revenue'].loc[pizza_data_top_clean['name'] == name] = rev
# make values in revenue column an int
pizza_data_top_clean['revenue'] = pizza_data_top_clean['revenue'].str.replace("\$|,", "")
pizza_data_top_clean['revenue'] = pizza_data_top_clean['revenue'].apply(int)
print(pizza_data_top_clean.head())
pizza_data_by_chain = list(map(lambda n: pizza_data_top_clean.loc[pizza_data_top_clean['name'] == n], chain_names))
fig = px.box(pizza_data_top_clean, x="name", y="menus.itemPrice",
labels={
"name":"Pizza Chain by Revenue (highest to lowest)",
"menus.itemPrice":"Item Price (USD)"
},
category_orders={"name": chain_names})
fig.update_layout(title_text="Menu Item Price Distribution by Pizza Chain", title_x=0.5)
fig.show()
categories city country latitude \
0 Restaurant,Pizza Place,Restaurants El Paso US 31.862409
1 Restaurant,Pizza Place Jacksonville US 30.204576
2 Restaurant,Pizza Place Iowa City US 41.649871
3 Restaurant,Pizza Place Iowa City US 41.649871
4 Restaurant,Pizza Place Iowa City US 41.649871
longitude menus.itemPrice \
0 -106.558431 5.00
1 -81.785252 7.99
2 -91.501576 13.00
3 -91.501576 13.00
4 -91.501576 13.00
menus.name name \
0 HotNReady Large Pizza Little Caesars Pizza
1 Large 3 Topping Pizza Little Caesars Pizza
2 Garden Vegetarian Pizza Baking Required Papa Murphy's Pizza
3 Medium Chicken Garlic Gluten Free Crust Pizza ... Papa Murphy's Pizza
4 Medium Cowboy Gluten Free Crust Pizza Baking R... Papa Murphy's Pizza
postalCode priceRangeMin priceRangeMax province revenue
0 79912 0 25 TX 4770500000
1 32257 0 25 FL 4770500000
2 52240 0 25 IA 808727000
3 52240 0 25 IA 808727000
4 52240 0 25 IA 808727000
/opt/conda/lib/python3.8/site-packages/pandas/core/indexing.py:670: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
From the box plot, we can see the general range of prices for the menu items for the top pizza chains. One attribute of the box plots that we notice is that the majority of the first quartiles are greater than or equal to \$10; thus, the majority of the items for these restaurants are priced above \\$10. The three chains that have a first quartile below \$10 are Little Caesars Pizza, Sbarro, and Round Table Pizza.
Let's look into how these three restaurants are able to still make profit despite their low selling prices. Some reasons for why these restaurants are so cheap but still making a high revenue could be because of discounted ingredients, careful portioning, convenience in certain locations, and other tactics.
Now that we've analyzed star ratings, location, and prices, let's try to see if there exists a predictive relationship between median menu item price, average star ratings, and total revenue using linear regression. For the regression, we used median menu item price since median is a more robust measure of central tendency than mean. As you can see in the above box plot, many of the price ranges for each chain contains outliers.
# plot median menu item prices by chain vs revenue
pizza_data_by_chain = list(map(lambda n: pizza_data_top_clean.loc[pizza_data_top_clean['name'] == n], chain_names))
# find median menu item price for each chain and add to top_chains_df
top_chains_df['median_price'] = np.nan
for df in pizza_data_by_chain:
chain = df['name'].unique()[0]
med_price = df['menus.itemPrice'].median()
top_chains_df.loc[top_chains_df['name'] == chain, ['median_price']] = med_price
# make values in top_chains_df's revenue column an int
top_chains_df['revenue'] = top_chains_df['revenue'].str.replace("\$|,", "")
top_chains_df['revenue'] = top_chains_df['revenue'].apply(int)
print(top_chains_df.head())
# plot median menu item prices by chain vs revenue
fig = px.scatter(top_chains_df, x="revenue", y="median_price", color="name",
labels={
"revenue":"Revenue of Top Pizza Chains (USD)",
"median_price":"Median Item Price (USD)",
"name":"Pizza Chain by Revenue (high to low)"
},
category_orders={"name": chain_names},
trendline="ols",
trendline_scope = 'overall',
trendline_color_override = 'black')
fig.update_layout(title_text="Median Menu Item Price vs Revenue", title_x=0.5)
fig.show()
# show regression results
model = px.get_trendline_results(fig)
results = model.iloc[0]["px_fit_results"].summary()
results
rank name num_units revenue avg_stars median_price 0 1 Domino's Pizza 15914 13545200000 2.215410 12.99 1 2 Pizza Hut 18431 12212000000 1.974376 15.00 2 3 Little Caesars Pizza 5465 4770500000 2.468610 7.99 3 4 Papa John's 5345 3500000000 2.289688 13.99 4 5 Papa Murphy's Pizza 1404 808727000 3.278880 14.00
/opt/conda/lib/python3.8/site-packages/scipy/stats/stats.py:1603: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
| Dep. Variable: | y | R-squared: | 0.024 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | -0.098 |
| Method: | Least Squares | F-statistic: | 0.1977 |
| Date: | Mon, 20 Dec 2021 | Prob (F-statistic): | 0.668 |
| Time: | 04:29:31 | Log-Likelihood: | -28.644 |
| No. Observations: | 10 | AIC: | 61.29 |
| Df Residuals: | 8 | BIC: | 61.89 |
| Df Model: | 1 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 11.6708 | 1.912 | 6.103 | 0.000 | 7.261 | 16.081 |
| x1 | 1.398e-10 | 3.14e-10 | 0.445 | 0.668 | -5.85e-10 | 8.65e-10 |
| Omnibus: | 0.663 | Durbin-Watson: | 2.618 |
|---|---|---|---|
| Prob(Omnibus): | 0.718 | Jarque-Bera (JB): | 0.006 |
| Skew: | -0.055 | Prob(JB): | 0.997 |
| Kurtosis: | 2.944 | Cond. No. | 7.75e+09 |
The regression line shows that there is a slight positive correlation between median item price and revenue. However, the points in the graph do not fit the line very well, and the r-squared value (0.024) is quite low. This means that there isn't truly a predictive relationship between median item price on a menu of a chain and revenue. This could be because of the lower six pizza chains in the top ten, which together have a wide range of median prices. This means that there must be other factors besides price that contribute to the lower six chains' success, specifically Sbarro and Chuck E. Cheese's since they are the furthest from the regression line.
# plot median menu item prices by chain vs average star rating
fig = px.scatter(top_chains_df, x="avg_stars", y="median_price", color="name",
labels={
"avg_stars":"Average Star Rating of Top Pizza Chains",
"median_price":"Median Item Price (USD)",
"name":"Pizza Chain by Revenue (high to low)"
},
category_orders={"name": chain_names},
trendline="ols",
trendline_scope = 'overall',
trendline_color_override = 'black')
fig.update_layout(title_text="Median Menu Item Price vs Average Star Rating", title_x=0.5)
fig.show()
# show regression results
model = px.get_trendline_results(fig)
results = model.iloc[0]["px_fit_results"].summary()
results
/opt/conda/lib/python3.8/site-packages/scipy/stats/stats.py:1603: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
| Dep. Variable: | y | R-squared: | 0.035 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | -0.086 |
| Method: | Least Squares | F-statistic: | 0.2866 |
| Date: | Mon, 20 Dec 2021 | Prob (F-statistic): | 0.607 |
| Time: | 04:29:31 | Log-Likelihood: | -28.590 |
| No. Observations: | 10 | AIC: | 61.18 |
| Df Residuals: | 8 | BIC: | 61.79 |
| Df Model: | 1 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 7.5019 | 8.898 | 0.843 | 0.424 | -13.017 | 28.020 |
| x1 | 1.7943 | 3.352 | 0.535 | 0.607 | -5.934 | 9.523 |
| Omnibus: | 0.300 | Durbin-Watson: | 1.809 |
|---|---|---|---|
| Prob(Omnibus): | 0.861 | Jarque-Bera (JB): | 0.106 |
| Skew: | -0.176 | Prob(JB): | 0.949 |
| Kurtosis: | 2.640 | Cond. No. | 18.0 |
The regression line on the plot shows that there is a slight positive correlation between median menu item price and average star rating of a chain. However, the r-squared value (0.035) is quite low, meaning there isn't truly a correlation between these two categories. We believe the slope of the regression line is due to the Sbarro and Chuck E. Cheese's points since they are the extrema while all of the other points are closer to the center of the graph. The lack of a predictive relationship between median price and average star rating means that price does not as much of an effect on a reviewer's star rating of a chain as we thought. There must be other factors that go into a reviewer's star rating of a pizza chain.
# plot median menu item prices by chain vs average star rating
fig = px.scatter(top_chains_df, x="revenue", y="avg_stars", color="name",
labels={
"avg_stars":"Average Star Rating of Top Pizza Chains",
"revenue":"Revenue (USD)",
"name":"Pizza Chain by Revenue (high to low)"
},
category_orders={"name": chain_names},
trendline="ols",
trendline_scope = 'overall',
trendline_color_override = 'black')
fig.update_layout(title_text="Average Star Rating vs Revenue", title_x=0.5)
fig.show()
# show regression results
model = px.get_trendline_results(fig)
results = model.iloc[0]["px_fit_results"].summary()
results
/opt/conda/lib/python3.8/site-packages/scipy/stats/stats.py:1603: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
| Dep. Variable: | y | R-squared: | 0.429 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.358 |
| Method: | Least Squares | F-statistic: | 6.008 |
| Date: | Mon, 20 Dec 2021 | Prob (F-statistic): | 0.0399 |
| Time: | 04:29:32 | Log-Likelihood: | -3.2978 |
| No. Observations: | 10 | AIC: | 10.60 |
| Df Residuals: | 8 | BIC: | 11.20 |
| Df Model: | 1 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 2.8477 | 0.152 | 18.781 | 0.000 | 2.498 | 3.197 |
| x1 | -6.11e-11 | 2.49e-11 | -2.451 | 0.040 | -1.19e-10 | -3.62e-12 |
| Omnibus: | 2.429 | Durbin-Watson: | 1.059 |
|---|---|---|---|
| Prob(Omnibus): | 0.297 | Jarque-Bera (JB): | 0.908 |
| Skew: | 0.131 | Prob(JB): | 0.635 |
| Kurtosis: | 1.547 | Cond. No. | 7.75e+09 |
Based on the slope of the regression line and the r-sqared value (0.429), it seems that there is a slight predictive relationship or correlation between average star rating of a chain and revenue. As revenue increases, the predictive averge star rating decreases. It is interesting how the more successful a restaurant is in terms of revenue, the worse their Yelp review ratings are. We had originally hypothesized the opposite: the more successful a restaurant is, the higher their Yelp star rating should be. As we've mentioned before, this could make sense due to the several factors of how Yelp reviews work.
Most people would most likely have a relatively good or average experience with a bigger chain that wouldn't incite them to make a Yelp review for the restaurant. However, if a customer has a relatively negative experience with the quality of the food or the staff at a bigger chain or a well known restaurant, then they are more likely to write a review about their experience because these top businesses are expected to have high standards for their restaurants and upkeep a good reputation versus a lesser known restaurant.
Based on our analysis, we found that menu item prices and locations could potentially have an effect on a pizza chain’s success. In some cases, such as for Little Caesars Pizza and Sbarro, having low menu item prices could possibly help the chain generate more revenue. It's also possible that the more major city locations a pizza chain has, the more revenue they can generate. We also found that the Yelp reviews do not accurately reflect which businesses are successful and that there are other factors other than price that go into a Yelp review rating. There are many factors that can make a pizza chain successful, and to know the full extent of these factors would take much more than a single tutorial and a couple of datasets. For further analysis on this topic, it’d be interesting to analyze specific menu items, quality of ingredients, partnering retailers, and customer demographics. In our tutorial, we did not cover these attributes due to the lack of (free) data available.